### **Review of Von Neumann Architectures**

## A little history... programs

#### Stored program model has been around for a long time...





## **R-type Instructions**

- □ All instructions have 3 operands
- □ All operands must be registers
- Operand order is fixed (destination first)

**Example:** 

**C code:** A = B - C;

(Assume that A, B, C are stored in registers s0, s1, s2.)

MIPS code: sub \$s0, \$s1, \$s2

Machine code: 000000 10001 10010 10000 xxxxx 100010

**Other R-type instructions** 

addu, mult, and, or, sll, srl, …



How about load the next word in memory?



## An important idea...

 $\frac{Instructions}{\Pr ogram} \times \frac{Clock cycles}{Instruction} \times \frac{Seconds}{Clock Cycle} = \frac{Seconds}{\Pr ogram} = CPU time$ 

#### • We can see CPU performance dependent on:

- Clock rate, CPI, and instruction count

#### CPU time is directly proportional to all 3:

- Therefore an x % improvement in any one variable leads to an x % improvement in CPU performance
- But, everything usually affects everything:



common

denominator

## **Pipelining Lessons (laundry example)**



- <u>Multiple</u> tasks operating simultaneously
- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Pipeline rate limited by <u>slowest</u> pipeline stage
- Potential speedup = <u>Number pipe stages</u>
- Unbalanced lengths of pipe stages reduces speedup
- Also, need time to "<u>fill</u>" and "<u>drain</u>" the pipeline.

Pipelineing can cost overhead. But what if free? What if really deep?

#### Lecture 07 - Review of Von Neumann Architectures

## **Example: Nanomagnetics**

#### **Schematic**

**Device** 

Wire

Gate

Inverter

Inverted output





*quantum cellular automata,*" Science **287**, 1466, 2000A. Imre, "Experimental Study of Nanomagnets for

A. Imre, "Experimental Study of Nanomagnets for Magnetic QCA Logic Applications," U. of Notre Dame, Ph.D. Dissertation.

R. Cowburn, M. Welland, "Room temperature magnetic



CT 3

A. Imre, et. al., "Majority logic gate for Magnetic Quantum-Dot Cellular Automata," Science, vol. 311, No. 5758, pp. 205–208, January13, 2006.

A. Imre, et. al. "Magnetic Logic Devices Based on Field-Coupled Nanomagnets," NanoGiga 2007.

A. Imre, et. al., "Majority logic gate for Magnetic Quantum-Dot Cellular Automata," Science, vol. 311, No. 5758, pp. 205–208, January13, 2006.

### **Example: Nanomagnetics**





## Single-cycle diagrams: cycle 4



# **Data hazard specifics**

- There are actually 3 different kinds of data hazards!
  - Read After Write (RAW)
  - Write After Write (WAW)
  - Write After Read (WAR)
- We'll discuss/illustrate each on forthcoming slides. However, 1<sup>st</sup> a note on convention.
  - Discussion of hazards will use generic instructions i & j.
  - i is always issued before j.
  - Thus, i will always be further along in pipeline than j.
- With an in-order issue/in-order completion machine, we're not as concerned with WAW, WAR

# **Read after write (RAW) hazards**

- With RAW hazard, instruction j tries to read a source operand before instruction i writes it.
- Thus, j would incorrectly receive an old or incorrect value
- **Graphically/Example:** •



Can use stalling or forwarding to resolve this hazard •

# **Branch/Control Hazards**

- So far, we've limited discussion of hazards to:
  - Arithmetic/logic operations
  - Data transfers
- Also need to consider hazards involving branches:

#### - Example:

- 40: beq \$1, \$3, \$28 # (\$28 gives address 72)
- 44: and \$12, \$2, \$5
- 48: or \$13, \$6, \$2
- 52: add \$14, \$2, \$2
- 72: Iw \$4, 50(\$7)
- How long will it take before the branch decision takes effect?
  - What happens in the meantime?

# Pipelining and ILP

- Pipelining provides for some instruction level
  parallelism
  - (multiple instructions executing at the same time)
- Hazards hurt ILP
  - (sometimes we have to stall the pipeline and wait b/c of instruction/data dependencies)
- Dynamic scheduling (next topic) might help...

### **Dynamic Scheduling: Motivation**

|                              | 1 | 2 | 3  | 4  | 5         | 6         | 7  | 8  | 9  | 10 |
|------------------------------|---|---|----|----|-----------|-----------|----|----|----|----|
| divf <pre>f0,f2,f4</pre>     | F | D | E/ | E/ | E/        | E/        | W  |    |    |    |
| addf f6, <mark>f0</mark> ,f2 |   | F | D  | d* | d*        | d*        | E+ | E+ | W  |    |
| <pre>mulf f8,f2,f4</pre>     |   |   | F  | p* | <b>p*</b> | <b>p*</b> | D  | E* | E* | W  |

- cycle4: addf stalls due to RAW hazard
  - OK, fundamental problem
- also cycle4: mulf stalls due to *pipeline hazard* (addf stalls)
  - why? **mulf** can't proceed into ID because **addf** is there
  - but that's the only reason  $\Rightarrow$  not good enough!
- why can't we decode mulf in cycle 4 and execute it in c5?

**University of Notre Dame** 

• no fundamental reason why we can't do this!

© 2007 by Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti ECE 252 / CPS 220 Lecture Notes Dynamic Scheduling I

### Scheduling

scheduling: re-arranging instructions to maximize performance

- requires knowledge about structure of processor
- requires knowledge about latencies and dependences

two options for who should schedule instructions

- static scheduling: by compiler
- dynamic scheduling: by hardware

ECE 252 / CPS 220 Lecture Notes Dynamic Scheduling I

# Scheduling

- Finds instructions to execute in each cycle
  - Static (in-order) scheduling: looks only at the next instruction
  - Dynamic (out-of-order) scheduling: looks at a "window" of instructions
- How many instructions are we looking for?
  - 3-4 is typical today, 8 is in the works
  - A CPU that can ideally do N instrs per cycle is called "N-way superscalar", "N-issue superscalar", or simply "N-way" or "N-issue".

## **Static Scheduling**

- Cycle 1
  - Start I1.
  - Can we also start I2? No.
- Cycle 2
  - Start I2.
  - Can we also start I3? Yes.
  - Can we also start I4? No.
- If the next instruction can not start, stops looking for things to do in this cycle!



# **Dynamic Scheduling**

- Cycle 1
  - Operands ready? I1, I5.
  - Start I1, I5.
- Cycle 2
  - Operands ready? I2, I3.
  - Start I2,I3.
- Window size (W): <sup>15</sup> how many instructions ahead do we look.
  - Do not confuse with "issue width" (N).
  - E.g. a 4-issue out-of-order processor can have a 128entry window (it can look at the next 128 instructions).

Program code



## **Register Renaming**

- Solution: give I3 some other some other name (e.g. S) for the value it produces.
- But I4 uses that value, so we must also change that to S...



- In fact, all uses of R5 from I3 to the next instruction that writes to R5 again must now be changed to S!
- We get rid of output dependences in the same way: change R2 in I5 (and subsequent instrs) to T.

## **Overhead of dynamic scheduling**

• Need excess state that keeps track of renames, etc.



### **R10K Pipeline**

same pipeline structure: IF, DS, IS, EX, CM, RT Red = different than before

- DS (dispatch)
  - (RS or ROB or MOB full or no physical registers) ? (stall) : If structural hazard
  - · (allocate RS and ROB entries AND physical register) If no structural hazard
- IS (issue)
  - (read physical registers)
- CM (completion)
  - (writeback destination register, mark ROB entry complete)
- RT (retire, commit, graduate)
  - (ROB head not complete) ? (stall) :
  - {if store then write MOB head to D\$, handle any exceptions, free ROB/MOB entries, free previous physical register}

ECE 252 / CPS 220 Lecture Notes Dynamic Scheduling II

# R10K: Dispatch (DS)



© 2007 by Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti ECE 252 / CPS 220 Lecture Notes Dynamic Scheduling II If add R1, R2, R3 (R2 = T7, R3=T24)

#### R10K: Complete (CM)



© 2007 by Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti ECE 252 / CPS 220 Lecture Notes Dynamic Scheduling II

48

#### Lecture 07 - Review of Von Neumann Architectures

R10K: Retire (RT) PRFmap table T÷ T Told R value tail free head RS list ROB T1+ Т2 T<sub>old</sub> = old mapping dispatch (need for exceptions) CDB.T FU stall until instruction at ROB head is complete return Told of ROB head to free list With above example, everything waiting on free ROB head entry T7 has value -- as we're at head of ROB R1 now = T3 ECE 252 / CPS 220 Lecture Notes 49 © 2007 by Sorin, Roth, Hill, Wood, Dynamic Scheduling II

#### Therefore, if Add R1, R1, R1 -- R1 initially T7, R1 renamed to T3 Here, $T_{old} = T7$ , T = T3

Sohi, Smith, Vijaykumar, Lipasti

## **Overhead of dynamic scheduling**

• Need excess state that keeps track of renames, etc.



## **Question?**

- How much of a chip is "memory"?
  - 10% Some Perspective
  - 25% - 50% - 75% - 85% 15% 85%

© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti COMPSCI 220 / ECE 252 Lecture Notes Storage Hierarchy I: Caches 11

University of Notre Dame

29

## If I say "Memory" what do you think of?

- Memory Comes in Many Flavors
  - SRAM (Static Random Access Memory)
  - DRAM (Dynamic Random Access Memory)
  - ROM, Flash, etc.
  - Disks, Tapes, etc.
- Difference in speed, price and "size"
  - Fast is small and/or expensive
  - Large is slow and/or expensive
- The search is on for a "universal memory"
  - What's a "universal memory"
    - Fast and non-volatile.
  - May be MRAM, PCRAM, etc. etc.

Let's start with DRAM. Its generally the largest piece of RAM.

### Is there a problem with DRAM?

#### **Processor-DRAM Memory Gap (latency)**





## Where can a block be placed in a cache?

- 3 schemes for block placement in a cache:
  - <u>Direct mapped</u> cache:
    - Block (or data to be stored) can go to only 1 place in cache
    - Usually: (Block address) MOD (# of blocks in the cache)
  - Fully associative cache:
    - Block can be placed anywhere in cache
  - <u>Set associative</u> cache:
    - "Set" = a group of blocks in the cache
    - Block mapped onto a set & then block can be placed anywhere within that set
    - Usually: (Block address) MOD (# of sets in the cache)
    - If n blocks, we call it n-way set associative



## Memory access equations

- Using what we defined on previous slide, we can say:
  - Memory stall clock cycles =
    - Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty
- Often, reads and writes are combined/averaged:
  - Memory stall cycles =
    - Memory access x Miss rate x Miss penalty (approximation)
- Also possible to factor in instruction count to get a "complete" formula:

 $CPU time = IC \times \left( CPI_{execution} + \frac{Memory \, stall \, clock \, cycles}{Instruction} \right) \times Clock \, cycle time$ 

## **Reducing cache misses**

- Obviously, we want data accesses to result in cache hits, not misses –this will optimize performance
- Start by looking at ways to increase % of hits....

More devices = more

- ...but first look at 3 kinds of misses! cache to help reduce
  - Compulsory misses:
    - Very 1st access to cache block will not be a hit –the data's not there yet!
  - Capacity misses:
    - Cache is only so big. Won't be able to store every block accessed in a program – must swap out!
  - Conflict misses:
    - Result from set-associative or direct mapped caches
    - Blocks discarded/retrieved if too many map to a location